December 5, 2018 to December 14, 2018

I decided to start a journal this week to keep track of the work I am doing in regards to applying LIME to the Hamby bullet data. I have been trying to decide which input values to use for LIME in the forensic examiners’ paper, and I have decided that I would like to simplify matters even more by only focusing on a few input options. I have considered a handful of ways of answering this question (accuracy within one implementation of lime, accuracy across 10 implementations of lime, etc.), but I would like to reserve some of these ways for a future more technical paper. However, I would like a place to store my work to use for future purposes, and it would be nice to have a place to work where I do not feel the need to optimize code. I am hoping that this will be a place where I can write drafts of code and create first versions of graphics. Then I can clean up and organize the code when I write the paper.

Prior to this journal, I had been working on all of my ideas in the R markdown document for the paper, but now I am going to transfer a lot of that code to this journal. Many of the sections below include work that I did earlier this semester. I am just transferring them here now. Later I will move the necessary parts to the paper.

Previous Work

Feature Plots of Training and Testing Data Filled by samesource

I created the following two plots of the training data as suggested by Heike. The histograms below show the distributions of the features used in the random forest rtrees. The histograms are filled by the samesource variable, which is the truth of whether or not the comparison is from the same barrel and land. The default histograms make it hard to compare the distributions of the matches and non-matches since there are many more comparisons that have samesource == FALSE.

By setting position = "fill" in the geom_histogram function, it is easier to compare the matches and non-matches. These plots could be used in the future to hand select the bins for lime. Additionally, fitting a logistic regression to this data could also be used to determine the LC50, LC10, ad LC90, which could be used as the bins for lime.

The plots below have the same structure, but they are created with the Hamby 224 testing data. I chose to not separate the testing data by sets, but this is something that could be done later if necessary.

Correlation Plots of Training and Testing Data

I made these plots to look at the correlation between features in the training data within the TRUE and FALSE cases of samesource. The features are highly correlated for the match comparisons. It is clear that the variables are more correlated with the match comparisons than the non-match comparisons. However, there are still some variables that are relatively highly correlation with the non-match comparisons.

The plots below show the correlations for the testing data. The patterns in the plots look really similar to ones of the training data.

Visualizations of the LIME Explanations

Heike made a plot with the structure of the one below when we began with the default settings of lime. This one has been created from the lime explanations with 2 quantile bins. It includes all three of the chosen features for each case in the testing dataset. For both sets, the cutoff of 0.275 < ccf occurs the most frequently.

The plot below is the model for the one that will be used in the app for exploring the lime explanations from the bullet matching data. This is the data from set 1 of the testing dataset.

Draft of LIME Procedure from the lime R Package

I wrote this up to start thinking about how to describe the procedure that the lime R package uses to implement the LIME algorithm. It needs a lot of work, but it is a start. The final version of this will end up in the technical stats paper critiquing LIME. The version in the firearm examiner’s paper will be much simpler…

The steps below explain the procedure that the R package is using to apply the LIME algorithm to the bullet matching predictions on the Hamby 224 clone dataset made by the random forest model from Hare. For simplicity, the steps are described as what happens to one case in the test data. Thus, the steps (2) through (7) are repeated for each observation in the testing dataset.

Let \[Y_{jk} = \begin{cases} 1 & \mbox{ if bullets } j \mbox{ and } k \mbox{ were fired from the same gun barrel }\\ 0 & \mbox{otherwise} \end{cases}\] be the response variable in the training dataset, and \(X_1,...,X_9\) correspond to the nine features in the training dataset. Let \(X'_1,...,X'_9\) be the

  1. Distributions for each of the features in the training data are obtained.
    • The method that lime uses to obtain the distribution differs based on the feature type. All of the features in the Hamby datasets are numeric. For numeric features, the default option in lime (quantile_bins = TRUE) computes the quantiles of each feature based on the number of bins selected. The default number of bins is 4 (n_bins = 4).
  2. Many (\(n\)) samples from each of the feature distributions are drawn.
    • To do this, lime has several options (mostly quoted from lime package for now):
    • bin_continuous = TRUE should continuous variables be binned?
    • quantile_bins = TRUE should the ins for n_bins be based on quantiles or spread evenly
    • n_bins = 4 number of bins if bin_continuous is TRUE
    • use_density = TRUE if bin_continuous is FALSE, should continuous data be sampled using kernel density estimation (if not, then will assume normal for continuous variable)
  3. Predictions for the testing data using the random forest model are computed.
    • The random forest model rtrees is used to make a prediction for the observation from the test dataset and each of the \(n=5000\) samples as to whether or not the comparison of the two bullets in the test case are a match. Since the random forest is a classification model, lime is set to return the prediction probabilities.
  4. Similarity score between the observation in the testing data and each of the \(n=5000\) sampled values are obtained.
    • The way that the similarity score is computed depends on the type of feature. Since all of the features in the Hamby 224 test dataset are continuous, the simulated values are first converted into 0-1 features where a 1 indicates that the feature from the simulated value falls in the same bin as the observed data point and a 0 indicates that the feature is not in the same bin as the observed data point. Then, by default, the Gower distance is used to compute the similarity score. (using the gower package in R)
  5. Feature selection is performed by fitting some type of regression model weighted by the similarity scores is to the simulated data and the observed value. The 0-1 versions of the features are used.
    • The user can specify the number of features, \(m\), they would like to select to explain the prediction. lime supports the following options for feature selection
    • forward selection with ridge regression
    • highest weight with ridge regression
    • LASSO model
    • tree model
    • default: forward selection if \(m\le6\) with a ridge regression model, highest weight with ridge regression otherwise
  6. A ridge regression model is fit as the simple model by regressing the prediction probabilities on the \(m\) selected predictor variables and weighted by the similarity scores. If the response is categorical, the user can select how many categories and which categories they want to explain. \[P(Match = TRUE) = \beta_0 + \beta_1 \cdot I\left[X_1 \in \mbox{obs bin}\right] + \beta_2 \cdot I\left[X_2 \in \mbox{obs bin}\right] + \beta_3 \cdot I\left[X_3 \in \mbox{obs bin}\right]\] For the prediction of interest, \[P(Match = TRUE) = \beta_0 + \beta_1 + \beta_2 + \beta_3.\]
  7. The feature weights are extracted and used as the explanations.

Note: I realized that if bin_continuous = FALSE, then bins are not used at all. Instead, a kernel density estimator is used to sample from the distribution (or a normal distribution if specified), and then the ridge regression models are fit without “numerified” values.

Work on Determining Input Values for LIME

When we began looking at the explanations from lime with the default settings, we did not think that they made sense. This led me to try applying LIME with a handful of input values. However, since LIME is based on random permutations, I was curious to know how consistent the results were. This led me to try running each implementation for specific starting values a handful of times. The next two sections consider the results from these studies.

Assessing the Accuracy of the LIME Results

The figures in this section are created from the implementations of LIME on the set 1 from the training data for the input options of 2 to 6 quantile bins, 2 to 6 equally spaced bins, kernel density estimation, and normal distribtion approximation. Each set of input values was only run once.

Complex versus Simple Model Predictions

The plot below compares the predictions from the “simple model” (the ridge regression model) and the “complex model” (the random forest rtrees) on the x-axis from the lime implementations with bin estimation (bin_continuous == TRUE). The simple model predictions are on the y-axis, and the complex model predictions are on the x-axis. The plot is faceted by number of bins and whether or not the bins are equally spaced or based on quantiles. The points are colored by the \(R^2\) value from the fit of the simple model. The lines are linear regression lines fit to the data points within a facet. We would hope that there is a linear relationship between these two variables. None of the cases show strong linear trends, but some are more linear than others. The quantile bins show that the simple model never makes a prediction over 0.6, whereas the random forest model can have predictions of up to 1. The equally spaced bins do have probabilities that exceed 0.6, but only with 3 and 6 bins. I noticed that within the facets, the points are in mostly horiztonal strips, and the number of strips is about the number of bins from the lime implementation.

The plot below shows the absolute value of the difference between the complex model prediction and the simple model prediction versus the complex model prediction. The points are faceted by number of bins and whether the bins are equally spaced or based on quantiles. Again, the points are colored by the \(R^2\) values from the simple model. All cases show a v-shaped trend. The low part of the v occurs around a random forest score of about 0.25. That is, the simple model is most accurately portraying the complex model predictions around the random forest score of 0.25. It gets worse near the extremes. However, it is interesting to note that with the equally spaced bins, the absolute difference decreases near 1 for 3 to 6 bins. This si not the case with the quantile bins. It appears that the equally spaced bins are able to make slightly better predictions than the quantile bins when there is a high probability of a match.

At somepoint, I got the idea in my head that it woudl be interesting to look at the “residuals” (difference in complex and simple model predictions) by feature. The plot below shows one example with the data from 3 equally spaced bins. There are clear trends in the plots, but I am not quite sure what to make of this or how to use it. This may be something to return to.

Comparing Input Values by MSE and \(R^2\)

In order to assess which lime implementation is doing the best job of capturing the predictions from the random forest model, I decided to calculate the mean squared error for each of the input situations. I defined the mean squared error as \[\frac{\sum_{i=1}^n (\hat{p}_{\mbox{simple},i}-\hat{p}_{\mbox{complex},i})^2}{n}\] where \(n\) is the number of observations in the testing dataset (within a set). Additionally, I was curious to compare the fits of the ridge regression models across input values, so I calculated the average \(R^2\) for each input situation.

The plot below shows the mean squared errors for each of the input situations faceted by set. In both cases, the lowest mean squared error occurs with 3 equally spaced bins. This seems to suggest that using 3 equally spaced bins should provide better explanations.

The plot below shows the average \(R^2\) value for each of the input situations faceted by set. The results are very similar for each set, and the highest \(R^2\) values occur for 2 quantile bins. The \(R^2\) values for 3 equally spaced bins are in the middle.

Assessing the Variability of the LIME Results

In order to assess how variable the LIME results are, I ran each of the 12 input situations 10 times. I wanted to look at the variability of the mse for each situation, and I wanted to see how consistent the choice of selected features is by LIME.

The plot below shows boxplots of the 10 mean squared errors computed for each of the input situations for each set. There does not appear to be much variation for most of the cases. The largest amount of variation occurs with 4 and 5 equally spaced bins. There is moderate variability for 3 equally spaced bins, but all of the 10 runs resulted in much lower mses than any of the other situations. This still seems to support the choice of using 3 equally spaced quantile bins for both sets.

To assess the consistency of the results, I determined the number of different features chosen by LIME as the top feature within a test case across the 10 replicates. I then computed the average number of different top features chosen by LIME within a input situations for each set. The plot below contains these values shown as the dots. The error bars represent one standard deviation above and below the mean. Note that the number of levels cannot go below 1, but the error bars go below 1. I have decided to ignore this for now, but I would need to come up with a better representation if I used this for something.

There are a handful of situations with a mean near one and very little variation. However, the largest mean is around 2, which is not very high. This seems to suggest that LIME is relatively consistent across runs.

The bar chart below shows the proportion of different number of top features chosen across all test cases within a set for each input situation. This shows that the most different features chosen was four. However, many of the caes only have one or two different features chosen, and usually, a larger proportion of the cases have only one top feature chosen across the 10 reps.

The plot below needs some work, but it does contain some interesting information. Right now, I am only concentrating on set 1. We can see that with some lime settings, all cases have the same “best” variable chosen, and other settings have multiple important variables that depend on the case. We can also see that different variables are selected as important depending on the number of bins. It is also interesting to note that when comparing the quantile bins to the equally spaced bins, this plot shows that the equally spaced bins tend to choose the same first variable for all cases. On the other hand, the quantile bins choose different first variables for the cases.

The fact that different features get chosen as the top feature for different estimation techniques makes sense. You would expect that the number of bins would determine which features are better suited for predicting whether or not a comparison is a match. For example, if you look back at the feature distribution plots, you can see that ccf is the obvious choice for determining between matches and non-matches if you have to choose two equally spaced bins. This is the feature that lime most frequently chooses. You can make similar arguments for other cases as well. Even though different inputs select different variables, I would expect that some variables will be better at prediction than others. I think this is what leads some input values to better LIME explanations and lower MSEs than others.

In the meeting with Heike, we talked about how this plot suggests that LIME is not doing a very good job of providing local explanations. Regardless of the case, it is often suggesting only one of two variables are important, and the variable importance depends on the number of bins. In this plot, I would like consistency in the x-direction, but I would like variability in the y-direction to better understand the variables that are important for a specific case.

I was thinking a bit more about the relationship between the top features chosen and the type of binning used. I realized it would be helpful to view the relationship between the features and the random forest score. The plots below show the testing data with random forest score versus each of the features. The points are colored by the samesource variable.

Deciding on LIME Input Settings

In order to decide on which LIME settings to use for the firearm examiners’ paper, I decided to create a table that ranks each of the 12 input settings by MSE from the single implementation, the average MSE across the 10 implementations, the average number of top features chosen across the 10 implementations, and the average \(R^2\) value from the single implementation. The first table below is for set 1, and the second table below is for set 11.

While using these to help me decide which input settings to use, I had a handful of thoughts:

  • Since I am still working on making sense of the variation results, and it would be nice to get the first paper out, I am thinking of basing the decision of input values entirely off of only the assessment measures from the one case.
  • Even with needing more time to think about how to understand the meaning of the results from the 10 reps, the mean MSE from these has very similar results to the MSEs from the one rep, and it does not appear that the top chosen features vary too much between the reps. This makes me think that it would be okay to base my decision off of the results from the single rep.
  • Both Heike and I do not think that the \(R^2\) is that helpful of a measure for determining whether the LIME explanations are good, so I am not going to put much weight on it.
  • Additionally, I do not want to consider either the normal approximation or kernel density estimate since I have not figured out a good way to visualize these in the app. Right now, the feature plots from these explanations are not very helpful. I would need to adjust the intercept or something.

Based on the MSEs from the single impelementation of LIME, I will use 3 equally spaced bins for both the set 1 and set 11 explanations in the firearm examiner paper since it leads to the lowest MSE for both sets.

Notes on meeting with Heike

  • We talked about including the penalty for the number of parameters in the dicussion of the paper.
  • We talked about using a tree to choose the bins. This would allow us to automate the process and to obtain a penalty parameter since the trees give us nesting. (MSE + lambda * p where p is the from the number of trees multiplied by the number of …)
  • Look at the AUC after binning

December 15, 2018 to January 8, 2018